Predicting Diameter and Physical Harm of Asteroids using Machine Learning

Authors : Colin Campbell (c_c953), Leah Lewis (lrl68), Ryan Wakabayashi (rjw102) and Jake Worden (jrw294)

Abstract : Asteroid prediction is an ever-growing need with the increase in space travel and thoughts of space expansion. This project attempts to find the best performing models to predict the diameters of unknown asteroids as well as classify them as physically hazardous. After utilizing models for both Regression and Classification, the Random Forest Regressor and Decision Tree Classifier where shown to have the highest perfomance in accordance to our target success goals. Therefore, with some further work and tuning, these models could provide a potentially powerful means for identifying unknown small-bodies for planetary defense.

Introduction

The ability to use the collected data on nearby asteroids and determine whether they present a threat to life on Earth is critical for the future of civilization. Companies like Nasa and SpaceX are currently working on technologies to identify these celestial threats and their trajectory. As the amount of data we can collect increases, it is imperative that we determine which attributes are key to identify how hazardous an asteroid is to Earth. The goal of this project is to use machine learning techniques to accurately predict an asteroids diameter and predict if an asteroid is physically hazardous to life on Earth based on features of the unknown asteroid.

Problem Statement

Given a dataset of asteroid features, can machine learning be used to predict an unknown asteriod's diameter and determine whether it's physically hazardous or not in accordance to the following sucess measures?

ML Approach CV Score F1 Precision Recall R2 MSE MAPE
Predicting Diameter >85% - - - >85% ≤ 1.2 ≤ 25
Classifying Physically Hazardous Asteroid ≥80% ≥80% ≥80% ≥80% - - -

Each of the following kaggle projects attempted to apply machine learning to predict diameter of asteroids and served as a solid benchmark for this project:

Kaggle Project Title Project Author Project Description Project Results Link to Project
Asteroid Diameter Estimators Liam Toran Supervised Regression Targeting Diameter R2: 84.5% (XGBoost) See Kaggle
Asteroid Diameter Prediction TitanPointe Evaluate MSE of ML Models for Asteroid Prediction MSE: 12 (Random Forrest) See Kaggle

Data Management

Data Gathering

Links to the database and dataset used can be found below:

Small-Body DataBase Link: Jet Propulsion Laboratory Solar System Dynamics

Open Asteroid Dataset Link: Asteroid_Updated.csv

Open Asteroid Dataset Description:

  1. The dataset was created on behalf of NASA by the Jet Propulsion Laboratory (JPL) at California Institute of Technology's "Solar System Dynamics"(SSD) group. One of the primary responsibilites of the SSD group is to maintain the Small-Body DataBase (SBDB) which is comprised of information relating to the orbits, physical parameters, discovery cirumstances and hazard assements for all known small-bodies in our solar system. In this context, small-bodies are defined as comets and asteroids, where “asteroids” includes Kuiper-belt objects (TNOs) and dwarf planets. This database is actively kept up-to-date, meaning as new data is made available for both new and existing small-bodies, new orbits are automatically computed typically within an hour or two.
  1. The dataset used here was gathered from the SSD's SBDB via the Open Asteroid Dataset challange posted on Kaggle.
  1. The dataset itself is composed of various instances of small-bodies, along with their respective oribital elements. A summary of each of these elements or features along with a brief desciption can be seen in the following table:
Feature Description
a Semi-major axis(au)
e Eccentricity
i Inclination with respect to x-y ecliptic plain(deg)
om Longitude of the ascending node
w Argument of perihelion
q Perihelion distance(au)
ad Aphelion distance(au)
per_y Oribital period(YEARS)
data_arc Data arc-span(d) float64
condition_code Orbit condition code
n_obs_used Number of Observation used
H Absolute magnitude parameter
neo Near Earth Object object
pha Physically Hazardous Asteroid
diameter Diameter of asteroid(Km)
extent Object bi/tri axial ellipsoid dimensions(Km)
albedo Geometric albedo
rot_per Rotation Period(h)
GM Standard gravitational parameter, Product of mass and gravitational constant
BV Color index B-V magnitude difference
UB Color index U-B magnitude difference
IR Color index I-R magnitude difference
spec_B Spectral taxonomic type(SMASSII)
spec_T Spectral taxonomic type(Tholen)
G Magnitude slope parameter
moid Earth minimum orbit intersection distance(au)
class Asteroid orbit class
n Mean motion(deg/d)
per Orbital period(d)
ma Mean anomaly(deg)

The data gathering phase will attempt to answer the following questions in regards to the dataset :

  1. How many data entries are there and what datatypes are present?
  2. How many null values, if any, are present in the dataset?
  3. How many resources, in terms of memory, are used by the dataset?

Importing all libraries for data gathering

Read the csv file using pandas read_csv() method and print the first five entries

Use pandas shape method to identify the amount of data available

Use pandas info method to identify the data types, null values and memory usage

Data Gathering Results

From the data gathering, the following was obtained in relation to the initial inquries regarding the dataset:

  1. There are a total of 839,714 data entries each with 31 features to explore. Of these features there are 21 float64, 1 int64, and 9 object data types.
  2. Some features contain 0 null values while others contain more than 700,000 nulls.
  3. The data's memory usage is approximately 198.6MB.

Based on this information, the data will need to undergo some exstensive pre-processing prior to any exploratory data analysis .

Data Pre-processing, Cleaning, Labeling, and Maintenance

Initial data gathering showed that the dataset is comprised of 839,714 data entries consisting of 31 variables made up of 3 different data types (float64, int64, object). Additionaly, it was determined that while some of the data entries' features contained no null values, it was shown that approximately 700,000 entries contained at least one null value if not many more. Since the goal of this project is to create both a regressor and a classifer for targeting diameter and physically harzardous asteriod respectively there is a need to address the frequency of null value occurences throughout the dataset.

This phase of data exploration will attempt to address the following concerns:

  1. What features have a high number of Nulls? How will these null values be addressed for both regression and classification?
  2. How will the incorrect data types be handled for regression?
  3. How will any class imbalance be handled for classification?

Importing all libraries for data pre-processing, cleaning, labeling and maintenance

Answering Q1 from Data Pre-processing, Cleaning, and Maintence

The results of the initial data gathering had shown the existence of a larger quantity of null values within the dataset. The pandas module can be used to print the sum of null values to determine which columns had a high percentage of null values. If either of the targets (diameter, pha) are present amoungst the columns, additional data cleaning will be needed prior to any exploratory data analysis. Otherwise any other column found to contain a large frequency of null values can be discarded from the dataframe as they will not be useful for either regression or classification.

Use pandas isnull and sum methods to print the sum of null values within the dataframe

As shown in the above cell output, 12 of the 31 features contained about 700,000 null values. Amongst these 12 features was diameter, the target for regression, with a total sum of 702,078 null values. The removal of these entries with null values for diameter would greatly diminish the size of the dataset, therefore the remaining 11 features can be dropped from the dataframe for now.

Use pandas drop method to drop the features from the dataframe

Now that the 11 features containing more than 700,000 null values have been dropped from the dataframe, pandas can used again to determine the remaining number of null values needing to processed prior to exploratory data analysis.

Use pandas isnull and sum methods to print the sum of null values within the dataframe

Q1 Results

Voila!✨ From the above cell it is shown that the dataframe contains significanty less null values and the remaining data is now constrained only by the null values present in the target features. Any further processing will need to address the specific data needs as it pertains to either classification or regression.

Answering Q2 from Data Pre-processing, Cleaning, and Maintence

During data gathering, there were 9 categorical features observed to be of type object within the dataset. Of these 9 features, 5 were dropped during the initial cleaning and the remaining 4 features, pha, neo, condition_code and class contained only categorical data. In order for these remaining 4 features to be used by a machine learning algorithm they must first be converted from categorical to numerical data types. After converting these features, any further remaining data of incorrect type must also be converted to numeric values before attempting to drop the remaining null values from the dataframe.

Create seperate dataframe to be used for regression

Currently the values belonging to both pha and neo features can either be 'Y' or 'N', while the values belonging to the condition_code feature range from 0 to 9, 'D', and 'E'. By using the map method provided by the pandas module, each value within the provided dictionary can easily be mapped to corresponding nurmerical values.

Use dictionaries along with pandas map method to transform categorical data to numerical data

Results from the above transformation now have pha and neo features containing either the value 0 or 1 and the new condition_code feature values ranging from 1 to 12.

Unlike the values belonging to the 3 features converted in the above cell, the values belonging to the class feature do not share any sort of sequential meaning or state with one another. Therefore, a label encoder can be used to map the feature values to arbitrary numerical representations.

Use Label Encoder to transform categorical data to numerical data

Now that each of the features initialy containing categorical objects have all been transformed to numerical data types, pandas can be used on the dataframe to change any non-convertable values to 'NaN' within each feature. By changing non-numeric values to be NaN, it will ensure that after the final drop of null values the dataframe will only contain numeric values.

Use pandas apply and to_numeric methods to change any non-convertable values to NaN

From here the dataframe should only consist of either numeric, NaN or null data types. Therefore the remaining features containing only null values will now be dropped from the dataframe. Additionally, any entries within the dataframe containing any null values will also be removed from the dataframe.

Use pandas dropna method to remove null values from the dataframe

To verify that the dataframe for regression is now cleaned, pandas can be used to show that there are currently no null values within the dataset and that all of values within the dataset are only of numeric types.

Use pandas isnull and sum methods to print the sum of null values within the dataframe

Use pandas dtypes method to show the catalog of data types in the dataframe

The dataset for regression is now cleaned and ready for some exploratory data analysis! But before taking a deep dive into it, it may be useful to see the final size of the dataset that's going to be used.

Use pandas shape method to identify the dimensions of the dataframe

Q2 Results:

Wowza!✨ Following the data cleaning for regression, the dataset went from containing 839,714 entries with 31 features to only containing 127,910 entries with just 20 features! Now that the dataset for regression is finalized, exploratory data analysis will help determine which of these 20 features, if any, can be used to model asteroid diameter.

Answering Q3 from Data Pre-processing, Cleaning, and Maintence

One major hurdle to overcome when pre-processing data for classification is the posible existence of class imbalances. Class imbalances refer to higher instances of one particular class that will lead to biases in the classifiers. These biases toward the majority class can then result in bad classifications of the minority class. For example, if the originally the dataset had a distribution of 499,000 instances of class 0 and 1000 instances of class 1, then the classification scores will be based upon how accurate the classifer is at predicting instances being of class 0. Therefore, before handling any non-numeric and null datatypes in the datasets it would be best to address any significant class imbalances present within the dataframe.

Create seperate dataframe to be used for classification

Prior to handling any non-numeric datatypes and null values in the datasets, pandas can be used to determine whether the the classifer's target feature, pha, has a significant class imbalance.

Use pandas groupby method to output the class size distribution

The results from the above cell shows that the pha feature's 0 class has 821,257 instances, while the 1 class has only 2015. This significant disproportion, if not adjusted, will lead to a high bias towards the majority class, 0, and give false high accuracy scores. To solve this problem, a Variational Autoencoder (VAE) will be needed to generate 6000 extra samples for the minority class.

After using the VAE.ipynb included within this repository, new data was appended to the original csv and saved as a new csv.

Read the new csv file created by the VAE using pandas read_csv() method and print the first five entries

After reading the new csv generated by the VAE, pandas can be used to recheck the the class size distrubition of pha to verify the exisitence of new samples belonging to the minority class.

Use pandas groupby method to output the class size distribution

The new class distribution for pha shows the addition of a new '1.0' class comprised of the 6000 values generated by the VAE. This discrepancy can be resolved by using the same approach taken when cleaning the dataframe for regression. Except this time the pha feature value'1.0' will also be mapped to '1' along with the 'Y' values. Additionally, the mapping of the other categorical datatypes to numerical datatypes, as seen when cleaning the regression dataframe, can also be completed at this time.

Use dictionaries along with pandas map method to transform categorical data to numerical data

Use Label Encoder to transform categorical data to numerical data

This portion of code is to visualize what we originally had as our classification data. We noticed that we were being left with a very small amount of samples for class 1 so we tried dropping other columns with higher null values to see if they were the cause of the drops.

By dropping none of the features with high null values we are left with an empty dataset, therefore we need to drop at least those with majority nulls like we did in the regression section.

In the above output there are only 207 samples for class 1. By dropping the featres diameter, w, per, and ma we saw it better maintain the data's integrity and only drop a few samples from class 1.

Use pandas drop method to drop the features from the dataframe

Now that all unnecessary columns have been dropped from the dataframe and all incorrect datatypes have been corrected, the remaining features containing only null values can be dropped from the dataframe as well as any entries containing any null values.

Use pandas dropna method to remove null values from the dataframe

To verify that the dataframe for classification is now cleaned, pandas can be used to show that there are currently no null values within the dataset and that all of values within the dataset are only of numeric types.

Use pandas isnull and sum methods to print the sum of null values within the dataframe

Use pandas dtypes method to show the catalog of data types in the dataframe

The classification data frame now contains 0 null values throughout all 15 features and all features have numeric data types. Although this was a sufficiant stopping point for regression, following the creation of the additional 6000 minority class values for pha, there still exists a significant class imbalance needing to be addressed. The pandas module can again be used to display the current class distribution for pha within the dataframe for classification following the dropping of null values.

Use pandas groupby method to output the class size distribution

As seen in the above cell, the pha feature's class distribution following the data cleaning is as follows - 818,167 instances of class 0 and 8013 of class 1. At this point in time, the resample method provided by the sklearn utility module can be used to downsample the majority class '0.0' frequency to that of the minority class '1.0'. This will remove the large bias encountered by our original, imbalanced dataset and give more accurate results.

Use sklearn resample method to downsample the majority class to match the minority class

Following the downsampling of the majority class, use pandas to see the resulting class distribution.

Print the counts of each classification for the feature pha

Finally the pha classes are balanced and we can continue to EDA for both datasets.

Use pandas shape method to identify the dimensions of the dataframe

Q3 Results:

Following the data cleaning and class balancing needed for a classifer, the dataset went from containing 839,714 entries with 31 features to only containing 16,026 entries with just 15 features!😲 While only having a dataset consiting of 16,026 entries is not ideal, it ensure that any classification model made will contain no bias towards a particular class. And now that the dataset for classification is finalized, exploratory data analysis can help determine which of these 15 features, if any, can be used to determine whether a unknown asteroid could be considered physically hazardous.

Exploratory Data Analysis

Now that the dataframes for both regression and classification have been cleaned, in order to identify features with strong relationships and correlation values in their respective datasets, visualizations and feature selection tools can be used to provide both graphical and mathematical insight into the identification of which features, if any, have the strongest relationship to the targets. Seaborn pairplot helps display the relationship between features and the target values. A Correlation heatmap can visualize the strength of correlation, positive or negative, between features and targets. The correlation value can range from -1 to 1. Analysis of Variance (ANOVA) is a popular tool for statistics. ANOVA helps find out whether the differences between groups of data are statistically significant. The F-score calculation within ANOVA determines the ratio of explained variance to unexplained variance.

Both the regression and classification dataframes will seperately undergoe the same exploratory data anaylsis process.

Import all libraries for exploratory data analysis

Exploratory Data Analysis for Regression

The first method that will be visualized is a seaborn pairplot of the regression dataframe. This will visualize the features correlation in a graphical format so that patterns may be recognized.

From the pair plot above, some of the features can be selected for modeling diameter prediction. The pair plot allows for the important features to be selected for prior to modeling. As seen above, features like Orbital Period (per_y), Absolute Magnitude Parameter (H), Perihelion distance (q), and Mean motion (n) should be selected as parameters of diameter prediction because a pattern can be visually seen.

Next a heatmap on the classification data frame will be created. This heatmap will focus on correlation of the Diameter (diameter) feature.

We will select the highest correlation values with respect to diameter in the heatmap generated from the regression data frame for modeling.

Features with correlation greater of at least |0.5|:

  1. Data arc-span (data_arc) [0.5]
  2. Absolute Magnitude (H) [-0.58]

Unfortunately, these correlation values are not very strong, meaning that diameter prediction will be difficult to achieve while using the regression dataset. Because of this, maintaining as much data as possible will be our main goal for this dataset. Hopefully with more data, the lower correlations can still be of use in predictions.

Working with large data sets and multiple models is expensive in processing power. Reducing dimensions by doing feature selection can cut down on this processing overhead. The seaborn pairplot and heatmap methods have allowed for a handful of features to be selected from the regression data set. These features can now be used for modeling, but it is helpful to check what features other methods may select. In order to ensure no useful features are left out ANOVA will be used. The scikit-learn statistical tool ANOVA, or SelectKBest specifically, can be used to automatically score how well a feature correlates to the target.

Set the target feature 'diameter' to y_regression and the other remaining features to x_regression

Now that x_regression and y_regression data sets contain values, an ANOVA can be performed using SelectKBest. The fit_transform method will transform x_regression to only contain our 10 best features from the dataset.

Use the SelectKBest function to create the anova then use the fit_transform method to acquire the ANOVA scores. Print the result

Higher values are desired for feature selection from the above ANOVA.

Regression Exploratory Data Analysis Results

Need more through results explanation.

Based on the ANOVA, heatmap, and pairplot we can see that there are multiple features that have high importance when determining diameter. The final 10 features that will be used for the regression models are listed in the following table from highest to lowest ANOVA score (ANOVA values may vary. Listed values are from a sample execution).

Feature ANOVA score
per_y 40.74103699750825
per 40.74103699750813
ad 32.88636968270681
a 31.637196697611962
H 17.841635817082636
q 11.095961249580835
moid 11.167513458143228
neo 10.855769669869455
n 9.20323399210887
class 9.089789442546143

These features listed above were chosen based on the visualizations and confirmed with the ANOVA. The visualizations showed us how all of the features correlated to eachother and was very large and slightly hard to read. With our ANOVA values, we can automatically maintain the features with the largest impacts to our target variable, diameter, which should lead to better predicitons.

Exploratory Data Analysis for Classification

The first method that will be visualized is a seaborn pairplot of the classification dataframe. This will visualize any relationships between features in a graphical format so that patterns may be easily recognized.

The physically hazardous asteroid feature is a binary 'Y' or 'N', thus a relationship between feature and target cannot be distinguished. This makes utilization of a heatmap much more important for visualization.

Next a heatmap on the classification data frame will be created. This heatmap will focus on correlation of the Physically Hazardous Asteroid pha feature.

From the heatmap above, multiple features can be visually identified to have a strong correlation with the pha feature.

Features with correlation greater than |0.7|:

  1. Near Earth Object (neo) [0.98]
  2. Asteroid orbit class (class) [-0.97]
  3. Features Eccentricity (e) [0.85]
  4. Absolute magnitude parameter (H) [0.68]

The features listed above may prove to be useful when attempting to predict pha.

Working with large data sets and multiple models is expensive in processing power. Reducing dimensions by doing feature selection can cut down on this processing overhead. The seaborn heatmap has allowed for a handful of features to be selected from the classification data set. These features can now be used for modeling, but because the pairplot did not provide any insight into feature relationships, ANOVA will be used like it was with the regression data set.

Set the target feature 'pha' to y_classification and the other remaining features to x_classification

Now that x_classification and y_classification data sets contain values, an anova can be performed using the sklearn SelectKBest feature. The fit method compares x_classification and y_classification, this allows for a final dataset containing resultant values to be created. We will not use fit_transform for this dataset because we want to visualize how impactful the features are and do feature selection separately.

Use the SelectKBest function to create the anova then use the fit method to acquire the anova values. Print the result

Higher values are desired for feature selection from the above ANOVA.

Classification Exploratory Data Analysis Results

Based on the ANOVA, heatmap, and pairplot we can see that there are multiple features that have high importance when classifying pha. The final 10 features that will be used for the regression models are listed in the following table from highest to lowest ANOVA score (ANOVA values may vary. Listed values are from a sample execution).

Feature ANOVA score
neo 339015.87
class 213978.95
e 40291.96
H 14018.56
n 11261.44
q 3635.74
moid 2722.93
i 2339.11
om 234.89
n_obs_used 153.41

These features were chosen for from visualization, and confirmed with the ANOVA. Because ANOVA gives a concrete numeric value, its results are valued more than visualization.

Exploratory Data Analysis Results:

The key features have been selected using both the visualization and ANOVA methods. Now that these features have been identified, machine learning approaches can be used to model the data. The 10 selected features are displayed in the following table for regression and classification. Several algorithms will be tested on both the classification and regression data sets, the goal is to find out which algorithms provide the most desirable results and why they work.

No. Regression Classification
1 per_y neo
2 per class
3 ad e
4 a H
5 H n
6 q q
7 moid moid
8 neo i
9 n om
10 class n_obs_used

Machine Learning Approaches

We tried multiple models for our regression prediction. When it came to parameter tuning, some took an excessive amount of resources and we chose to look elsewhere. If a model performed badly after gridsearch and 10-fold cross validation, we looked into more data and other methods of improving but inevitably found other models that performed well with less tuning and less computational cost.

All attempted models

Algorithm Supervised? Regression or Classification? Reasoning
Random Forest Supervised Regression High performing model, well known
K-Nearest Neighbor Unsupervised Regression Simple data-driven model
Stochastic Gradient Descent Supervised Regression Good for large amounts of data
Gradient Boosting Regressor Supervised Regression High performing model
Lasso Supervised Regression Feature selection
Support Vector Machine Supervised Regression Can model both linear and non-linear relationships between variables
Logistic Regression Supervised Classification Base model for predicting probability of target
Support Vector Machine Supervised Classification Works well for low-dimensional data
Decision Tree Supervised Classification Easy to compute and explain implementation

Regression Models for Predicting diameter

Diameter is a continuous target and thus requires a regression predictive model. We decided to use K-Nearest Neighbors, Epsilon-Support Vector, Gradient Boosting, and Random Forest algorithms to predict the diameter. We followed the following steps to ensure the best results were produced:

  1. Split the data using the selected features.
  2. Standardize the data with standard scalar if required.
  3. Run the algorithm using default parameters to output a base score.
  4. Run grid search for hyper-parameter tuning.
  5. Input returned parameters from grid search into the model for optimal results.
  6. Output optimal score, mean-squared error, and mean absolute percentage error.
  7. Run cross-validation and output the mean score.
  8. Visualize a scatterplot of the prediction vs the actual diameter.

Importing all libraries for regression models

Spilt the data into training and testing sets using test_size = 0.2 and random_state = 1. This ensures that 80% of the dataset will be used for the training set, while the remaining 20% will be used the testing set.

Use sklearn train_test_split method to split the data into training and testing sets

KNN Regressor

KNN is a simple data-driven model. KNN is a lazy algorithm that requires no learning, but has proven to be useful in real-world predictions by approximating the association of a continuous target based on a provided data.

Use sklearn KNeighborsRegressor fit method to train a KNN model

The KNN score from above is the coefficient of determination R2 and ranges from 0 to 1.0. For KNN regressor with default parameters, the score was only .66 which makes this current model unreliable and does not meet our goal measures. We now look for ways to improve upon this base model by hyper-parameter tuning with sklearn model selection GridSearchCV algorithm.

Use sklearn GridSearchCV fit method for paramater tuning

After running the grid search with the parameter grid above, the following parameters were selected as the best performing

The KNN model will now use the best performing parameters from the grid search to make an optimal model and print its results.

Use sklearn KNeighborsRegressor fit method to train a KNN model

This optimal KNN model still does not meet our goals for regression. 63.6% R-squared means our data is not fitting well with this model.

K-fold cross-validation is a powerful statistical method that estimates the skill of a model against unseen data. This method of cross-validation shuffles the data, splits it into k groups, removes one group, trains the model, and tests the model against the removed group.

Use sklearn cross_val_score method to validate the model with k-folds

The optimal KNN cross-validation score along with the other measures shows that KNN regressor is not a reliable prediction model for this dataset.

Use Matplotlib scatter method to create a scatter plot of Actual vs Predicted Diameter

Comparing the scatter plot above to the linear line, which shows a 100% R-squared, we see that this model tends to under-predict diameters between 100 and 300. There are also a couple outliers far above that were over-predicted. Overall, this is not the model we want to choose to predict diameter.

Epsilon-Support Vector Regression (SVR)

Support Vector Regressor has the ability to model both linear and non-linear relationships between variables.

The selected feature data must be scaled to avoid features with larger ranges from dominating the other features.

z = (x - u) / s where x is the sample, u is the mean, and s is the standard deviation of a feature

Use StandardScaler fit_transform method to scale the data

The selected feature data is now scaled into x_train_std and x_test_std and is ready to be used.

Use sklearn SVR fit method to train a SVR model

The SVR model with base parameters only scored .66. This score does not meet our goal measure, thus we move to hyper-parameter tuning with GridSearchCV.

Use sklearn GridSearchCV fit method for paramater tuning

Parameter tuning for SVR will be accomplished with gridsearchCV on parameters C and epsilon ranging from 0.001 to 100 by multiples of 10.

SVR was selected after attempting Stochastic Gradient Descent. After over 72 hrs of parameter tuning, approximately .40 was the highest R2 score achieved. SGD Regression on this dataset required the max_iter to be changed from the default of 1,000 to 1,000,000 to ensure convergence occured. Parameters epsilon and eta0 made almost no impact on the SGD Regression score, but a slight difference in the alpha caused the score to jump from approximately .40 into an unrealistically large integer.

Grid search returned the best performing parameters:

The grid search will output the best performing parameters and we will use them to make an optimal SVR model and print the results.

Use sklearn SVR fit method to train a SVR model

The optimal SVR score falls short of the set R2 goal. A score of 76.4% indicates that the model is not reliable enough to make accurate predictions.

We now use cross-validation to ensure consistency with our previous R2 score.

Use sklearn cross_val_score method to validate the model with k-folds

The mean cross-validation score was consistent with the previous SVR score at 76%. This supports that SVR is not a good model for this data.

Use Matplot scatter method to create a scatter plot graph

Gradient Boosting Regressor

Gradient Boosting Regressor (GBR), a boosted decision tree algorithm, was selected in hopes that it's slower learning rate will yield a higher accuracy when attempting to predict the target. The GBR model consists of a combination of many weak learners in order to achieve a single strong learner. The two hyperparemeters for GBR algorithms are n_estimators and learning_rate where n_estimators represents the number of trees added to the overall model and learning_rate represents the rate in which the model learns. When setting these parameters it is best to keep in mind that although performance can be expected to increase when using a slower learning rate, more trees will be needed in order to train the model. Thus, to prevent overfitting, a compromise must be made when choosing these hyperparameters.

To understand the model well, we first run a base GBR model using it's default parameters values( n_estimators=100, learning_rate=0.1).

Use sklearn GradientBoostingRegressor fit method to train a GBR model

The base GBR score of 88.6% is much higher than the goal we established. This model is reliable and can be used to make predictions. We now look to see if we can raise the GBR score by hyper-parameter tuning with GridSearchCV.

Use sklearn GridSearchCV fit method for paramater tuning

After running the grid search with the parameter grid above, the following parameters were selected as the best performing

We can now use the provided parameters, n_estimators=105, to create an optimally tuned GBR model.

Use sklearn GradientBoostingRegressor fit method to train a GBR model

The optimal GBR score surpasses the set R2 goal. A score of 88.6% indicates that the model may be considered reliable enough to make accurate predictions.

We now use cross-validation to ensure consistency with our previous R2 score.

Use sklearn cross_val_score method to validate the model with k-folds

Our optimal Random Forest averages around 87.7% in the cross validation meaning we are not overfit and due to it being close to our other score we can say that this model does perform well for this dataset.

Use Matplot scatter method to create a scatter plot graph

Random Forest Regressor

The next model chosen was Random Forest. This model is known for being high-performing but can be easily overfit so keeping an eye on this model is a must. The model is imported with sci-kit learn's ensemble library. To understand the model well, we first run a base RF model to see how it does with no parameters on the data.

Use sklearn's RandomForestRegressor fit method to train a RF model

Wow! With no parameters the Random Forest Regressor gave a score within our goals! The MSE is higher than we would want but is still relatively low. We will try running a parameter grid search and see if we can increase the score.

Use sklearn GridSearchCV fit method for paramater tuning

After running the grid search with the parameter grid above, the following parameters were selected as the best performing

The best score for the grid search wasn't much higher than our original score but we will use the parameters as our optimal model to see the output and compare to our base.

Use sklearn RandomForestRegressor fit method to train a RF model

The optimal Random Forest model isn't too much higher than our base but we can see that the measures all improved which shows the parameter tuning was a success. To better validate our model a 10-fold cross validation will also be performed.

Use sklearn cross_val_score method to validate the model with k-folds

Our optimal Random Forest averages around 87.7% in the cross validation meaning we are not overfit and due to it being close to our other score we can say that this model does perform well for this dataset.

Use Matplotlib scatter method to create a scatter plot to visualize Actual vs Predicted Diameters

This scatter plot shows that when comparing to the 100% R-Squared, the linear line, we do pretty well when predicting small diameters and start to slightly under/over predict the 150-300 range. Other than that there is only one large outlier that was over predicted but compared to the other models, we can say that the Random Forest Regressor performs well and could be used to predict an asteroids diameter.

Results from Regression

Model R-Squared MSE MAPE 10-Fold CV
KNN 63.7% 35.48 25.68% 63.47%
GBR 88.6% 11.17 23.53% 88.88%
SVR 76.4% 23.01 26.09% 76.12%
RF 88.9% 10.77 22.17% 87.73%

Our two best performing models are the Gradient Boosting Regressor and Random Forest. Both achieved all of our goals for Regression and therefore would be useful for predicting an unknown asteroids diameter.

Classification models for predicting if an asteroid is hazardous:

The feature Physically Hazardous Asteroid or pha only has two options, "Y" or "N". Therefore, using this target we needed to use classification to find the correct class to give each asteroid. To get the best possible results we used the following steps:

  1. Split the data using the selected features.
  2. Select the top two features using Principal Component Analysis.
  3. Run a base algorithm to see a baseline for the prediction.
  4. Find more impactful features to boost the scores.
  5. Output classification report to see all performance measures.
  6. Visualize the confusion matrix for the model.
  7. Run cross-validation and output the mean score.

Importing all libraries for classification models

Spilt the data into training and testing sets using test_size = 0.2 and random_state = 1. This ensures that 80% of the dataset will be used for the training set, while the remaining 20% will be used the testing set.

Use sklearn train_test_split method to split the data into training and testing sets

Logistic Regression

Logistic Regression is a simple and easy to use model so it was chosen as the baseline of all of our classification models. This supervised algorithm outputs the probabilities of samples belonging to a certain class.

Use sklearn to create logistic regression model as well as output classification report and confusion matrix.

Support Vector Classification

This model is a powerful supervised algorithm that maximizes the number of points within the boundaries with minimal error and maximal margin.

Use sklearn to create svc model and print the classification report and confusion matrix.

Use higher regularization to help control our model capacity and hopefully attain better scores.

This model actually performed well compared to our baseline at ~77%. It is still below our targets so other models are needed since this model may already be at a local optima.

Decision Tree Classifier

Since Random Forest did exceptionally with regression, we decided to give it a chance in classification as well.

Use sklearn to create decision tree model as well as print classification report and confusion matrix.

This model's base performs as well as SVC's tuned model giving hope that adding some tuning will increase the scores of this model. Some smaller tests were ran since the model executed rather quickly. For this model we ended up with min samples per leaf being 25 and a max depth of 7.

Awesome! We hit almost all of our success targets. The precision for class 1 is slighly under but everything performs well overall and it can be said that this model can be used to predict if an unknown asteroid is physically hazardous or not.

Results from Classification

Sucess Measures: CV Score F1 Precision Recall
Logistic Regression 67.8% 68% 68% 68%
Support Vector Machine 75.3% 77% 78% 77%
Decision Tree 82.0% 83% 84% 83%

Experiments

Different Ideas for dealing with the imbalance include:

After attempting to use the raw data to upsample and downsample we concluded that we needed more usable data instead of resampling our available data and the VAE was developed to address the issue. To do a simple sanity check we compared the means and standard deviations for each feature to ensure the autoencoder was creating usable data that closely resembled the old data.

Grid Search: Every regression model used a grid search with 10-fold CV to do parameter tuning. From these we obtained our optimal models and printed our results.

Conclusion

Machine Learning approaches were able to predict the diameter of an asteroid and also determine if the asteroid is physically hazardous to the Earth. The diameter was predicted by using regression models. When analyzing the data, a large number of entries for diameter were null, which restricted the size of the data set. Of the four regression models used, the best performing models were Random Forests and Gradient Boosting. Both of these models surpassed the project goals with scores of 88.89% and 88.56% respectively. Additionally, the goals were met for predicting whether or not an asteroid is considered physically hazardous to Earth. A classification approach was used for this prediction, and initially scores were very strong due to a class imbalance, which was assessed using a Variational Auto Encoder. After this, the data size was limited and the best scoring model was Decision Trees with a cross validation score of 82.6%. These results are promising to both the usability of machine learning approaches and the prospect of building a system to predict the presence of a physically hazardous asteroid.

References

Machine Learning in Python (2021), Scikit-learn. Accessed: Oct. 7, 2021. [Online]. Available: https://scikit-learn.org/stable/

Matplotlib 3.5.0 (2021), Matplotlib. Accessed: Oct. 25, 2021. [Online]. Available: https://matplotlib.org/stable/index.html

NumPy (2021), NumPy. Accessed: Oct. 27, 2021. [Online]. Available: https://numpy.org/

Pandas (2021), Pandas. Accessed: Oct. 21, 2021. [Online]. Available: https://pandas.pydata.org/docs/reference/index.html

Seaborn 0.11.2 (2021), Seaborn. Accessed: Oct. 25, 2021. [Online]. Available: https://seaborn.pydata.org/

TensorFlow Core v2.7.0 (2021), TensorFlow. Accessed: Oct. 27, 2021. [Online]. Available: https://www.tensorflow.org/api_docs/python/tf

V. Basu. "Open Asteroid Dataset." (2019). https://www.kaggle.com/basu369victor/prediction-of-asteroid-diameter?select=Asteroid_Updated.csv

Appendix

Task Colin Leah Jake Ryan
Introduction
Problem Statement
Related Work
Data Management
Machine Learning Approaches
Regression Models
KNN
SVR
GBR
RF
VAE
Classification Models
Logistic Regression
SVC
DT
Experiments
Conclusion
References
Total Contribution 9 12 7 10